大型語言模型架構的演進：從 BERT 到 GPT 與 T5

Transformer 架構的三要素

大型語言模型的演進標誌著一種范式轉移：從專用任務模型過渡到「統一預訓練」，即單一架構能適應多種自然語言處理需求。

這一轉變的核心在於自注意力機制，它使模型能夠權衡序列中不同詞彙的重要性：

$$Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

1. 僅編碼器（BERT）

機制：掩碼語言建模（MLM）。
行為：雙向上下文；模型一次「看見」整個句子，以預測被遮蔽的詞語。
最適合應用於：自然語言理解（NLU）、情感分析與命名實體辨識（NER）。

2. 僅解碼器（GPT）

機制：自回歸建模。
行為：從左到右處理；僅根據先前的上下文嚴格預測下一個詞元（因果遮蔽）。
最適合應用於：自然語言生成（NLG）與創意寫作。這正是現代大語言模型如 GPT-4 與 Llama 3 的基礎。

3. 編碼器-解碼器（T5）

機制：文字到文字轉換變壓器。
行為：編碼器將輸入字串轉換為密集表示，解碼器則生成目標字串。
最適合應用於：翻譯、摘要與對應任務。

關鍵洞察：解碼器主導性

行業已大幅集中於僅解碼器架構，因其更優越的擴展法則與零樣本情境下的衍生推理能力。

VRAM 上下文視窗影響

在僅解碼器模型中，KV 快取隨序列長度線性增長。10萬上下文視窗所需之 VRAM 显著超過 8 千視窗，因此若無量化技術，本地部署長上下文模型將極具挑戰性。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

Why did the industry move from BERT-style encoders to GPT-style decoders for Large Language Models?

Decoders scale more effectively for generative tasks and follow-up instructions via next-token prediction.

Encoders cannot process text bidirectionally.

Decoders require less training data for classification tasks.

Encoders are incompatible with the Self-Attention mechanism.

Question 2

Which architecture treats every NLP task as a "text-to-text" problem?

Encoder-Only (BERT)

Decoder-Only (GPT)

Encoder-Decoder (T5)

Recurrent Neural Networks (RNN)

Challenge: Architectural Bottlenecks

Analyze deployment constraints based on architecture.

If you are building a model for real-time document summarization where the input is very long, explain why a Decoder-only model might be preferred over an Encoder-Decoder model in modern deployments.

Step 1

Identify the architectural bottleneck regarding context processing.

Solution:
Encoder-Decoders must process the entire long input through the encoder, then perform cross-attention in the decoder, which can be computationally heavy and complex to optimize for extremely long sequences. Decoder-only models process everything uniformly. With modern techniques like FlashAttention and KV Cache optimization, scaling the context window in a Decoder-only model is more streamlined and efficient for real-time generation.

Step 2

Justify the preference using Scaling Laws.

Solution:
Decoder-only models have demonstrated highly predictable performance improvements (Scaling Laws) when increasing parameters and training data. This massive scale unlocks "emergent abilities," allowing a single Decoder-only model to perform zero-shot summarization highly effectively without needing the task-specific fine-tuning often required by smaller Encoder-Decoder setups.